Skip to content

use optional dil tensor & move semantics #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 26, 2020
Merged

Conversation

pinzhenx
Copy link
Contributor

@pinzhenx pinzhenx commented May 25, 2020

This PR is to optimize the dispatcher for aten ops:

  • optimize out dummy dil tensor creation in ShadeDataContext

We noticed the creation of dummy DIL tensor in ShadeDataContext cost a significant amount of time. So I change the dil_tensor field to optional.

  • use move semantics in gen_aten_tensor_by

Avoid the unnecessary shallow copy of tensors. Hence we're going to use it this way with a std::move.

at::Tensor AtenIpexCPUDev::foo(const at::Tensor& x) {
  dil::tensor y;
  ...
  return dbl::comm::gen_aten_tensor_by(std::move(y));
}

This PR reduce the dispatcher overhead from 90%+ to 20%+. The result was gathered under single thread with jemalloc enabled. The remaining gap is mostly cost by shallowUpgradeToDPCPPTensor. That's where we need to optimize in the future.

Here's the benchmark script.

import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ipex", action="store_true", default=False)
parser.add_argument('--num-warmup-runs', type=int, default=10)
parser.add_argument('--num-main-runs', type=int, default=100000)
args = parser.parse_args()

x = torch.rand(1, 1)
y_ref = x.relu()
if args.ipex:
    print("# USE IPEX")
    import _torch_ipex
    _torch_ipex._initialize_aten_bindings()
    x = x.to('dpcpp')
else:
    print("# NO IPEX")


for _ in range(args.num_warmup_runs):
    y = x.relu()

with torch.autograd.profiler.profile(True) as prof:
    for _ in range(args.num_main_runs):
        y = x.relu()

print(prof.key_averages().table(sort_by="self_cpu_time_total"))
assert(torch.equal(y_ref, y))

@pinzhenx
Copy link
Contributor Author

@EikanWang @hongzhen1

@EikanWang
Copy link
Contributor

LGTM

@pinzhenx pinzhenx force-pushed the optional branch 2 times, most recently from af2f809 to 0db0e60 Compare May 25, 2020 05:38
@EikanWang
Copy link
Contributor

@hongzhen1 r u okay with this optimization? I will merge this PR first in case some new modifications will break this.

@hongzhen1
Copy link
Contributor

@hongzhen1 r u okay with this optimization? I will merge this PR first in case some new modifications will break this.

LGTM

@pinzhenx pinzhenx marked this pull request as draft May 25, 2020 14:41
@pinzhenx pinzhenx marked this pull request as ready for review May 25, 2020 15:38
@EikanWang EikanWang merged commit a79be1b into intel:master May 26, 2020
EikanWang pushed a commit that referenced this pull request Oct 4, 2021
* enable fp32 lstm in cpu device

* lstm enable bf16

* Implement unit test

* add gather into black list

* Remove unnecessary lines and move test case position

* hook at module level

* copy _flat_weights into IpexLSTM # model.bias_ih_l0 will be incorrect

* add fp32 unit test

* refactor LSTM UT

* update comments

Co-authored-by: chunyuan <chunyuan.wu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants